Video Games Exploration by Anna Fedotova

This project explores Video Games Sales with Ratings dataset from Kaggle.

It is a combination of data obtained from web scrape of VGChartz Video Games Sales and a web scrape from Metacritic that provides games rating. There are some missing observations as Metacritic only covers a subset of the platforms. There are approximately 6,900 complete cases.

Data Cleaning

The dataset has 16719 observations and 16 variables:

## [1] 16719    16

Let’s make the summary of the data to see if there are any missing values.

##                           Name          Platform    Year_of_Release
##  Need for Speed: Most Wanted:   12   PS2    :2161   2008   :1427   
##  FIFA 14                    :    9   DS     :2152   2009   :1426   
##  LEGO Marvel Super Heroes   :    9   PS3    :1331   2010   :1255   
##  Madden NFL 07              :    9   Wii    :1320   2007   :1197   
##  Ratatouille                :    9   X360   :1262   2011   :1136   
##  Angry Birds Star Wars      :    8   PSP    :1209   2006   :1006   
##  (Other)                    :16663   (Other):7284   (Other):9272   
##           Genre                             Publisher    
##  Action      :3370   Electronic Arts             : 1356  
##  Sports      :2348   Activision                  :  985  
##  Misc        :1750   Namco Bandai Games          :  939  
##  Role-Playing:1500   Ubisoft                     :  933  
##  Shooter     :1323   Konami Digital Entertainment:  834  
##  Adventure   :1303   THQ                         :  715  
##  (Other)     :5125   (Other)                     :10957  
##     NA_Sales          EU_Sales         JP_Sales        Other_Sales      
##  Min.   : 0.0000   Min.   : 0.000   Min.   : 0.0000   Min.   : 0.00000  
##  1st Qu.: 0.0000   1st Qu.: 0.000   1st Qu.: 0.0000   1st Qu.: 0.00000  
##  Median : 0.0800   Median : 0.020   Median : 0.0000   Median : 0.01000  
##  Mean   : 0.2633   Mean   : 0.145   Mean   : 0.0776   Mean   : 0.04733  
##  3rd Qu.: 0.2400   3rd Qu.: 0.110   3rd Qu.: 0.0400   3rd Qu.: 0.03000  
##  Max.   :41.3600   Max.   :28.960   Max.   :10.2200   Max.   :10.57000  
##                                                                         
##   Global_Sales      Critic_Score    Critic_Count      User_Score  
##  Min.   : 0.0100   Min.   :13.00   Min.   :  3.00          :6704  
##  1st Qu.: 0.0600   1st Qu.:60.00   1st Qu.: 12.00   tbd    :2425  
##  Median : 0.1700   Median :71.00   Median : 21.00   7.8    : 324  
##  Mean   : 0.5335   Mean   :68.97   Mean   : 26.36   8      : 290  
##  3rd Qu.: 0.4700   3rd Qu.:79.00   3rd Qu.: 36.00   8.2    : 282  
##  Max.   :82.5300   Max.   :98.00   Max.   :113.00   8.3    : 254  
##                    NA's   :8582    NA's   :8582     (Other):6440  
##    User_Count          Developer        Rating    
##  Min.   :    4.0            :6623          :6769  
##  1st Qu.:   10.0   Ubisoft  : 204   E      :3991  
##  Median :   24.0   EA Sports: 172   T      :2961  
##  Mean   :  162.2   EA Canada: 167   M      :1563  
##  3rd Qu.:   81.0   Konami   : 162   E10+   :1420  
##  Max.   :10665.0   Capcom   : 139   EC     :   8  
##  NA's   :9129      (Other)  :9252   (Other):   7

There are some missing values in Critic_Score, Critic_Count, User_Score and User_Count. Let’s remove the rows with the missing values.

##                                           Name         Platform   
##  Madden NFL 07                              :   9   PS2    :1161  
##  LEGO Star Wars II: The Original Trilogy    :   8   X360   : 881  
##  Need for Speed: Most Wanted                :   8   PS3    : 790  
##  Harry Potter and the Order of the Phoenix  :   7   PC     : 703  
##  LEGO Batman: The Videogame                 :   7   XB     : 581  
##  LEGO Indiana Jones: The Original Adventures:   7   Wii    : 492  
##  (Other)                                    :6971   (Other):2409  
##  Year_of_Release          Genre                            Publisher   
##  2008   : 595    Action      :1677   Electronic Arts            : 957  
##  2007   : 590    Sports      : 973   Ubisoft                    : 500  
##  2005   : 562    Shooter     : 886   Activision                 : 498  
##  2009   : 554    Role-Playing: 721   Sony Computer Entertainment: 316  
##  2006   : 528    Racing      : 598   THQ                        : 309  
##  2003   : 499    Platform    : 407   Nintendo                   : 294  
##  (Other):3689    (Other)     :1755   (Other)                    :4143  
##     NA_Sales          EU_Sales          JP_Sales        Other_Sales      
##  Min.   : 0.0000   Min.   : 0.0000   Min.   :0.00000   Min.   : 0.00000  
##  1st Qu.: 0.0600   1st Qu.: 0.0200   1st Qu.:0.00000   1st Qu.: 0.01000  
##  Median : 0.1500   Median : 0.0600   Median :0.00000   Median : 0.02000  
##  Mean   : 0.3893   Mean   : 0.2331   Mean   :0.06295   Mean   : 0.08153  
##  3rd Qu.: 0.3900   3rd Qu.: 0.2100   3rd Qu.:0.01000   3rd Qu.: 0.07000  
##  Max.   :41.3600   Max.   :28.9600   Max.   :6.50000   Max.   :10.57000  
##                                                                          
##   Global_Sales      Critic_Score    Critic_Count      User_Score  
##  Min.   : 0.0100   Min.   :13.00   Min.   :  3.00   7.8    : 298  
##  1st Qu.: 0.1100   1st Qu.:62.00   1st Qu.: 14.00   8      : 267  
##  Median : 0.2900   Median :72.00   Median : 24.00   8.2    : 267  
##  Mean   : 0.7671   Mean   :70.25   Mean   : 28.78   8.5    : 245  
##  3rd Qu.: 0.7500   3rd Qu.:80.00   3rd Qu.: 39.00   7.5    : 240  
##  Max.   :82.5300   Max.   :98.00   Max.   :113.00   7.9    : 240  
##                                                     (Other):5460  
##    User_Count                 Developer        Rating    
##  Min.   :    4.0   EA Canada       : 152   T      :2420  
##  1st Qu.:   11.0   EA Sports       : 145   E      :2118  
##  Median :   27.0   Capcom          : 128   M      :1459  
##  Mean   :  173.4   Ubisoft         : 104   E10+   : 946  
##  3rd Qu.:   89.0   Konami          : 100          :  70  
##  Max.   :10665.0   Ubisoft Montreal:  88   RP     :   2  
##                    (Other)         :6300   (Other):   2

Let’s check the data types of the columns to make sure they are correct.

## 'data.frame':    7017 obs. of  16 variables:
##  $ Name           : Factor w/ 11563 levels "","'98 Koshien",..: 11059 5573 11061 6693 11057 6696 5572 11051 4966 11052 ...
##  $ Platform       : Factor w/ 31 levels "2600","3DO","3DS",..: 26 26 26 5 26 26 5 26 29 26 ...
##  $ Year_of_Release: Factor w/ 40 levels "1980","1981",..: 27 29 30 27 27 30 26 28 31 30 ...
##  $ Genre          : Factor w/ 13 levels "","Action","Adventure",..: 12 8 12 6 5 6 8 12 5 12 ...
##  $ Publisher      : Factor w/ 582 levels "10TACLE Studios",..: 371 371 371 371 371 371 371 371 330 371 ...
##  $ NA_Sales       : num  41.4 15.7 15.6 11.3 14 ...
##  $ EU_Sales       : num  28.96 12.76 10.93 9.14 9.18 ...
##  $ JP_Sales       : num  3.77 3.79 3.28 6.5 2.93 4.7 4.13 3.6 0.24 2.53 ...
##  $ Other_Sales    : num  8.45 3.29 2.95 2.88 2.84 2.24 1.9 2.15 1.69 1.77 ...
##  $ Global_Sales   : num  82.5 35.5 32.8 29.8 28.9 ...
##  $ Critic_Score   : int  76 82 80 89 58 87 91 80 61 80 ...
##  $ Critic_Count   : int  51 73 73 65 41 80 64 63 45 33 ...
##  $ User_Score     : Factor w/ 97 levels "","0","0.2","0.3",..: 79 82 79 84 65 83 85 76 62 73 ...
##  $ User_Count     : int  322 709 192 431 129 594 464 146 106 52 ...
##  $ Developer      : Factor w/ 1697 levels "","10tacle Studios",..: 1035 1035 1035 1035 1035 1035 1035 1035 621 1035 ...
##  $ Rating         : Factor w/ 9 levels "","AO","E","E10+",..: 3 3 3 3 3 3 3 3 3 3 ...

User Score and Year of Release are factors and should be converted to numeric.

##  [1] "1980" "1981" "1982" "1983" "1984" "1985" "1986" "1987" "1988" "1989"
## [11] "1990" "1991" "1992" "1993" "1994" "1995" "1996" "1997" "1998" "1999"
## [21] "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009"
## [31] "2010" "2011" "2012" "2013" "2014" "2015" "2016" "2017" "2020" "N/A"

There are also some missing values in the Year of Release. Let’s remove those values and then convert User Score and Year of Release to numeric data type.

## 'data.frame':    6894 obs. of  16 variables:
##  $ Name           : Factor w/ 11563 levels "","'98 Koshien",..: 11059 5573 11061 6693 11057 6696 5572 11051 4966 11052 ...
##  $ Platform       : Factor w/ 31 levels "2600","3DO","3DS",..: 26 26 26 5 26 26 5 26 29 26 ...
##  $ Year_of_Release: num  2006 2008 2009 2006 2006 ...
##  $ Genre          : Factor w/ 13 levels "","Action","Adventure",..: 12 8 12 6 5 6 8 12 5 12 ...
##  $ Publisher      : Factor w/ 582 levels "10TACLE Studios",..: 371 371 371 371 371 371 371 371 330 371 ...
##  $ NA_Sales       : num  41.4 15.7 15.6 11.3 14 ...
##  $ EU_Sales       : num  28.96 12.76 10.93 9.14 9.18 ...
##  $ JP_Sales       : num  3.77 3.79 3.28 6.5 2.93 4.7 4.13 3.6 0.24 2.53 ...
##  $ Other_Sales    : num  8.45 3.29 2.95 2.88 2.84 2.24 1.9 2.15 1.69 1.77 ...
##  $ Global_Sales   : num  82.5 35.5 32.8 29.8 28.9 ...
##  $ Critic_Score   : int  76 82 80 89 58 87 91 80 61 80 ...
##  $ Critic_Count   : int  51 73 73 65 41 80 64 63 45 33 ...
##  $ User_Score     : num  79 82 79 84 65 83 85 76 62 73 ...
##  $ User_Count     : int  322 709 192 431 129 594 464 146 106 52 ...
##  $ Developer      : Factor w/ 1697 levels "","10tacle Studios",..: 1035 1035 1035 1035 1035 1035 1035 1035 621 1035 ...
##  $ Rating         : Factor w/ 9 levels "","AO","E","E10+",..: 3 3 3 3 3 3 3 3 3 3 ...

Univariate Plots Section

Let’s look at the distributions of some of the variables.

Distribution of Global Sales

Global Sales distribution is a long tail one, but once converted to a logarithmic scale, it looks like normal distribution.

Let’s look at the sales distribution by region.

Global Sales by Region

Looking at the distribution of sales by region, it seems that the dataset consists of games that are mostly sold in the North American market (which makes sense since the subset of games we are looking at includes only the games that have rating on Metacritic.com which is a primarily American audience website).

Another observation is that many games have sales close to 0 in the markets outside of the US, which is represented by a high vertical bar on the left of the histograms.

Distribution by Year of Release

Most of the games in the dataset were released between 2000 and 2015. However, games with the highest median global sales were released before 2000.

Distribution by Genre

The most represented games genre in the dataset is Action, followed by Sports and Shooter. In terms of sales the most popular genres are Sports and Miscellaneous, followed by Platform, Shooter and Fighting.

Distribution by Platform

Sony consoles are leading in terms of the median sales per game (PS, PS3 and PS2). There is no clear relationship between the amount of games produced per platform and the median amount sold for this platform. For example the newest consoles from Nintendo (WiiU) and Microsoft (XOne) don’t have a lot of games released yet, but the median sales per game are quite high. Whereas PC games are abundant, but are generating very little sales (one possible explanation is that PC games are more prone to being pirated).

Let’s add a new variable called “Bestseller” for games that sold a million or more copies. Let’s look at the top Publishers and Developers in terms of total sales and see how many bestseller games they have in their portfolio.

Top 5 game publishers in terms of total Global Sales

## # A tibble: 5 x 6
##   Publisher          total_sales median_sales     n bestsellers best_share
##   <fct>                    <dbl>        <dbl> <int>       <dbl>      <dbl>
## 1 Electronic Arts           869.         0.53   945         273       28.9
## 2 Nintendo                  850.         1.03   293         149       50.8
## 3 Activision                536.         0.45   492         125       25.4
## 4 Sony Computer Ent~        388.         0.46   316          92       29.1
## 5 Take-Two Interact~        350.         0.44   273          77       28.2

The biggest game Publishers are Electronic Arts and Nintendo, each having sold more than 800 million game copies. While Electronic Arts stands out for the amount of published games (945), Nintendo has published far fewer games (293), but sold a median of twice as much copies per title. This also holds for the amount of bestsellers: while each of the top 5 publishers except for Nintendo has a bestseller raio of 25 to 29%, Nintendo’s portfolio consists of 51% of bestseller games.

Top 5 game developers in terms of total Global Sales

## # A tibble: 5 x 6
##   Developer      total_sales median_sales     n bestsellers best_share
##   <fct>                <dbl>        <dbl> <int>       <dbl>      <dbl>
## 1 Nintendo              530.         3.23    68          52       76.5
## 2 EA Sports             146.         0.6    142          46       32.4
## 3 EA Canada             131.         0.48   149          41       27.5
## 4 Rockstar North        119.         7.99    14          11       78.6
## 5 Capcom                115.         0.36   126          34       27.0

Top developers are Nintendo and Electronic Arts (EA Sports and EA Canada are both divisions of Electronic Arts). What stands out is that Nintendo is even more successful as a developer than it is as a publisher, having a median of 3 million sales per title and 76% of bestsellers in the portfolio. Another developer that stands out for its bestseller rate is Rockstar North, the developer of Grand Theft Auto franchise. With only 14 games developed, 11 of them became bestsellers (79%), bringing the company a median of 8 million copies sold per game title.

Distribution of Critic Scores and User Scores

Let’s analyse the distribution of Critic Scores and User Scores.

Critic Score Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   62.00   72.00   70.26   80.00   98.00

User Score Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   64.00   74.00   70.84   81.00   95.00

Both Critic Score and User Score distributions are skewed left. What is curious is while on average User Scores are slightly more positive, at the same time Critics tend to give more extremely positive scores, and Users tend to give more extremely negative scores (long tail of the distribution).

Univariate Analysis

Dataset Structure

Initially the dataset consisted of 16719 observations and 16 variables.

However, due to a number of missing variables, part of the observations was removed and the final amount of observations with full data is that of 6894.

There were also a number of adjustments to data types that will permit to run further analysis smoothly.

The following additional variables were created:

  • Manufacturer, factor (Platform Manufacturer) - it will permit to group platform data by console manufacturer
  • Bestseller, binary - 0 for games that sold less than 1 million copies, 1 for games that sold 1 million copies or more.

Features of interest

The data can be used to predict either the amount of game sales or whether a specific game will become a bestseller or not. Depending on the problem formulation, the target variable can be either the amount of copies sold (Global_Sales), or Bestseller (in this case the target would be binary, a game is either a Bestseller or not).

Global Sales has a long tail distribution, which is why graphs that include sales will be represented on a logarithmic scale.

The potential predictor variables are:

  • Genre
  • User Score
  • Critic Score
  • Platform
  • Manufacturer
  • Publisher
  • Developer
  • Year of Release

The dataset covers games released mostly between 2000 and 2015. The data itself was last updated in december 2016.

Bivariate Plots Section

In this section we will analyse more in detail possible relationships that can exist among different variables explored in the first section.

Relationship between Games Genre and Sales

As we saw in the previous section, some genres have higher global sales than others. We can also see that some genres are prone to more variance (for example Simulation), whereas others are less widespread (Adventure).

## [[1]]
## NULL
## 
## $Action
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.1200  0.2900  0.7334  0.7300 21.0400 
## 
## $Adventure
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.0600  0.1300  0.3088  0.2900  5.5400 
## 
## $Fighting
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.1300  0.3300  0.6597  0.7700 12.8400 
## 
## $Misc
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.1700  0.3800  1.0806  0.9875 28.9200 
## 
## $Platform
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.1200  0.3500  0.9375  0.9450 29.8000 
## 
## $Puzzle
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.0800  0.1400  0.6686  0.5600 15.2900 
## 
## $Racing
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.1100  0.2700  0.8167  0.7700 35.5200 
## 
## $`Role-Playing`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.1000  0.2600  0.7023  0.7000  9.7200 
## 
## $Shooter
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.1100  0.3400  0.9412  0.9025 14.7300 
## 
## $Simulation
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.0800  0.3000  0.6763  0.7175 12.1300 
## 
## $Sports
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.1600  0.3800  0.8787  0.8550 82.5300 
## 
## $Strategy
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.0400  0.0900  0.2521  0.2775  4.8400

Genre preferences by market

Let’s see whether genre preferences stay the same if we split the sales data by market.

Whereas North American and European markets are somewhat similar in terms of best selling genres, Japanese market seems to show different tendencies. The most selling game genres are Role-Playing and Puzzle, whereas the worst selling ones are Racing, Sports, Strategy and Shooter.

Let’s look at the top selling titles per region.

Top 5 most sold titles in North America

##                        Name    Genre NA_Sales Global_Sales
## 1                Wii Sports   Sports    41.36        82.53
## 2            Mario Kart Wii   Racing    15.68        35.52
## 3         Wii Sports Resort   Sports    15.61        32.77
## 4        Kinect Adventures!     Misc    15.00        21.81
## 5 New Super Mario Bros. Wii Platform    14.44        28.32

Top 5 most sold titles in Europe

##                                           Name  Genre EU_Sales
## 1                                   Wii Sports Sports    28.96
## 2                               Mario Kart Wii Racing    12.76
## 3                            Wii Sports Resort Sports    10.93
## 4 Brain Age: Train Your Brain in Minutes a Day   Misc     9.20
## 5                                     Wii Play   Misc     9.18
##   Global_Sales
## 1        82.53
## 2        35.52
## 3        32.77
## 4        20.15
## 5        28.92

Top 5 most sold titles in Japan

##                                          Name      Genre JP_Sales
## 1                       New Super Mario Bros.   Platform     6.50
## 2                 Animal Crossing: Wild World Simulation     5.33
## 3 Brain Age 2: More Training in Minutes a Day     Puzzle     5.32
## 4                   New Super Mario Bros. Wii   Platform     4.70
## 5                   Animal Crossing: New Leaf Simulation     4.39
##   Global_Sales
## 1        29.80
## 2        12.13
## 3        15.29
## 4        28.32
## 5         9.16

While the top titles in the US and Europe are almost the same, Japan has a very different list. The genres of top titles also very significantly, whereas in the US and Europe Sports and Racing make the top of the list, in Japan it is Platform and Simulation genres.

What about acceptance by critics and user by genre?

It looks like the genres preferred by Critics and Users are not necessarily the best selling ones. For instance, Puzzle genre is getting comparatively high critic scores, but is not selling well. Strategy is a genre receiving one of the best meadian scores by users, but is one of the worst in terms of sales (one posiible explanation for that is that Strategy games are more common on PC and as we saw earlier, PC games are among the worst selling ones).

Proportion of bestsellers per genre

## # A tibble: 12 x 6
##    Genre        total_sales median_sales     n bestsellers best_share
##    <fct>              <dbl>        <dbl> <int>       <dbl>      <dbl>
##  1 Misc               417.         0.38    386          95      24.6 
##  2 Shooter            817.         0.34    868         203      23.4 
##  3 Platform           378.         0.35    403          94      23.3 
##  4 Sports             836.         0.38    951         195      20.5 
##  5 Racing             479.         0.27    586         117      20.0 
##  6 Fighting           250.         0.33    379          75      19.8 
##  7 Action            1206.         0.290  1644         309      18.8 
##  8 Puzzle              78.9        0.14    118          21      17.8 
##  9 Simulation         204.         0.3     302          53      17.6 
## 10 Role-Playing       502.         0.26    715         123      17.2 
## 11 Adventure           81.5        0.13    264          16       6.06
## 12 Strategy            70.1        0.09    278          15       5.4

Top genres with the highest proportion of bestsellers are mostly inline with the previous findings about the best selling genres, top ones being Miscellaneous, Shooter and Platform.

Relationship between Critic and User Scores and Games Sales

There seems to be a positive correlation between Critic Score and Global Sales. The relationship is not exactly linear, there seems to be a slight explonential curve.

The relationship between User Score and Global Sales is also slightly positive, although much less pronounced than the relationship between Critic Score and Global Sales.

Let’s calculate Pearson correlation coefficient for Critic Score, User Score and Global Sales.

##              Global_Sales Critic_Score User_Score
## Global_Sales         1.00         0.24       0.09
## Critic_Score         0.24         1.00       0.58
## User_Score           0.09         0.58       1.00

As concluded earlier from the scatterplots, Critic Score has higher correlation (0.24) with Global Sales than User Score (0.09), making it a more useful metric to add to the prediction model.

What is more, Critic Score and User Score seem to be positively correlated with one another (0.58), so User Score should probably be removed from the model to avoid multicollinearity.

Let’s looking at the distribution of Critic Scores depending on whether the game is a bestseller ot not.

## $`0`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   60.00   70.00   68.08   78.00   98.00 
## 
## $`1`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20.00   74.00   81.00   79.49   87.00   98.00

Best selling games tend to receive higher Critic Score (an average score of 80 compared to an average of 68 for not bestsellers). The distribution of Critic Scores is less widespread for bestsellers (there is more unanimity among Critics when it comes to best selling games).

Bivariate Analysis

Importance of market adaptation

Some genres seem to be selling better than others. However, it is important to take into account that genre preferences may vary depending on the region. This is especially true for Japanese market, that clearly has different genre preferences compared to the North American or European markets.

Correlation between critic score and global sales

The feature with the strongest correlation with the target variable seems to be the critic score. This is true both when we look at Global Sales as a target variable (slight positive correlation, non-linear relationship), as well as at Bestseller target variable (games that are bestsellers tend to have higher critic scores).

User score, on the other hand, has weaker correlation with Global Sales, even though it might seem contra-intuitive, since it’s the end users after all who buy games.

Correlation between user score and critic score

Critic score and user score are postively correlated with each other. This relationship has to be taken into account when building a predictive model, since it can be a potential cause of multicollinearity.

Multivariate Plots Section

Now that we know that there is a certain correlation between critic score and sales, as well as there are some regional preferences for genres, let’s look at some other factors that might play role in creating a bestseller game.

Franchise brand name and game creators know-how

We can suppose that companies that are creating games are gradually becoming better at it, so the more games they release, the higher the sales per game.

Another possible assumptions is that if a game became a bestseller, making use of the same franchise can be a factor for success, since users are already familiar with the brand and are more likely to buy the game if they liked the previous one of the serie.

Mario Franchise was definitely a big success, with most of the released games gaining more than average global sales. What is curious though, is the original series and genres of the game, Super Mario series (Platform) and Mario Kart series (Racing), proved to be much more popular than the subsequent attempts to bring the franchise into other genres, like Sports, Puzzle or Role-Playing.

Final Fantasy Saga is a good example of the fact that brand name and a high quality past games are not the sole recipe for success. While the first games of the saga (Final Fantasy VII and Final Fantasy VIII) were a huge hits, the subsequent trend is decreasing, with most of the games from 2005 on selling below average.

Year of Release vs Platform

In the Final Fantasy Saga evolution, we can see that certain platforms appear and disappear with time. Let’s have a closer look at this relationship.

From the above graph we can clearly see the cycles of console generations, where the newest models replace the oldes ones, and therefore the latest games are produced for the newest console models. The platforms that are on the rise as per 2016 are XOne from Microsoft and PS4 from Sony.

Building the prediction model

Let’s fit a logistic regression model to predict whether a specific game will become a bestseller or not.

Let’s start by splitting the data into the training and testing sets and fitting a logistic regression with the following predictor variables:

  • Critic Score
  • User Score
  • Genre
  • Platform
  • Year of Release
## 
## Call:
## glm(formula = Bestseller ~ Critic_Score + User_Score + Genre + 
##     Platform + Year, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9299  -0.6170  -0.3513  -0.1391   3.6169  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -16.268892 324.744067  -0.050 0.960045    
## Critic_Score        0.110559   0.005051  21.889  < 2e-16 ***
## User_Score         -0.017174   0.004180  -4.109 3.98e-05 ***
## GenreAdventure     -1.182956   0.336491  -3.516 0.000439 ***
## GenreFighting      -0.282044   0.186075  -1.516 0.129582    
## GenreMisc           0.170572   0.174509   0.977 0.328350    
## GenrePlatform       0.117624   0.179483   0.655 0.512243    
## GenrePuzzle        -0.198313   0.306332  -0.647 0.517388    
## GenreRacing        -0.116706   0.160404  -0.728 0.466874    
## GenreRole-Playing  -0.419506   0.153122  -2.740 0.006150 ** 
## GenreShooter        0.137560   0.133606   1.030 0.303201    
## GenreSimulation     0.130330   0.213172   0.611 0.540946    
## GenreSports        -0.576819   0.134724  -4.281 1.86e-05 ***
## GenreStrategy      -1.310751   0.314492  -4.168 3.08e-05 ***
## PlatformDC         -1.992679   0.945844  -2.107 0.035137 *  
## PlatformDS          0.005264   0.355826   0.015 0.988196    
## PlatformGBA        -0.893862   0.422054  -2.118 0.034185 *  
## PlatformGC         -1.163700   0.413940  -2.811 0.004935 ** 
## PlatformPC         -1.971529   0.366671  -5.377 7.58e-08 ***
## PlatformPS         -0.798575   0.533424  -1.497 0.134374    
## PlatformPS2        -0.147572   0.365273  -0.404 0.686210    
## PlatformPS3         0.286110   0.321560   0.890 0.373596    
## PlatformPS4         1.296125   0.415212   3.122 0.001799 ** 
## PlatformPSP        -0.525334   0.373050  -1.408 0.159068    
## PlatformPSV        -1.201352   0.564828  -2.127 0.033425 *  
## PlatformWii         0.567358   0.345941   1.640 0.100997    
## PlatformWiiU        0.067380   0.452797   0.149 0.881705    
## PlatformX360        0.248401   0.324132   0.766 0.443464    
## PlatformXB         -1.718751   0.403356  -4.261 2.03e-05 ***
## PlatformXOne        0.974060   0.427663   2.278 0.022748 *  
## Year1992           -2.469574 459.257006  -0.005 0.995710    
## Year1996            9.807653 324.745201   0.030 0.975907    
## Year1997            9.538832 324.744603   0.029 0.976567    
## Year1998            9.265539 324.744453   0.029 0.977238    
## Year1999            9.141267 324.744345   0.028 0.977543    
## Year2000            8.119093 324.744059   0.025 0.980054    
## Year2001            8.891254 324.743882   0.027 0.978157    
## Year2002            8.666601 324.743869   0.027 0.978709    
## Year2003            8.626824 324.743865   0.027 0.978807    
## Year2004            8.923528 324.743860   0.027 0.978078    
## Year2005            8.082170 324.743861   0.025 0.980144    
## Year2006            8.009272 324.743851   0.025 0.980323    
## Year2007            8.337090 324.743835   0.026 0.979518    
## Year2008            8.463030 324.743837   0.026 0.979209    
## Year2009            7.732002 324.743853   0.024 0.981005    
## Year2010            8.331334 324.743847   0.026 0.979532    
## Year2011            8.011568 324.743852   0.025 0.980318    
## Year2012            8.089395 324.743862   0.025 0.980127    
## Year2013            8.071318 324.743877   0.025 0.980171    
## Year2014            7.607094 324.743939   0.023 0.981311    
## Year2015            7.130425 324.743991   0.022 0.982482    
## Year2016            6.187811 324.744040   0.019 0.984798    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5287.3  on 5514  degrees of freedom
## Residual deviance: 4097.1  on 5463  degrees of freedom
## AIC: 4201.1
## 
## Number of Fisher Scoring iterations: 11

We see that there are a number of statistically significant predictor variables, such as Critic Score, User Score as well as some of the genres and platforms.

Now let’s check our model for multicollinearity using VIF to make sure we don’t have variables that are highly correlated with each other.

##                    GVIF Df GVIF^(1/(2*Df))
## Critic_Score   1.527401  1        1.235881
## User_Score     1.596800  1        1.263645
## Genre          1.583116 11        1.021101
## Platform     115.312168 16        1.159935
## Year          98.486223 22        1.109951

Platform and Year of Release variables have a VIF of more than 10, meaning that there is a multicollinearity issue in our model. Earlier we saw there is a correlation between the year of release and the platform, since the newest games tend to get released on the latest generation of consoles. We will drop Year of Release in order to remove one of the mutually correlated variables.

## 
## Call:
## glm(formula = Bestseller ~ Critic_Score + User_Score + Genre + 
##     Platform, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8533  -0.6301  -0.3637  -0.1456   3.7026  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -8.519695   0.452287 -18.837  < 2e-16 ***
## Critic_Score       0.109401   0.004932  22.180  < 2e-16 ***
## User_Score        -0.015359   0.004009  -3.832 0.000127 ***
## GenreAdventure    -1.172063   0.333626  -3.513 0.000443 ***
## GenreFighting     -0.270969   0.184273  -1.470 0.141434    
## GenreMisc          0.225718   0.172627   1.308 0.191027    
## GenrePlatform      0.157199   0.176915   0.889 0.374243    
## GenrePuzzle       -0.136924   0.303074  -0.452 0.651423    
## GenreRacing       -0.068704   0.156666  -0.439 0.660996    
## GenreRole-Playing -0.388357   0.150842  -2.575 0.010036 *  
## GenreShooter       0.171692   0.131268   1.308 0.190889    
## GenreSimulation    0.229868   0.207391   1.108 0.267700    
## GenreSports       -0.523713   0.131739  -3.975 7.03e-05 ***
## GenreStrategy     -1.238508   0.312723  -3.960 7.48e-05 ***
## PlatformDC        -1.204736   0.860489  -1.400 0.161495    
## PlatformDS         0.389626   0.326289   1.194 0.232435    
## PlatformGBA       -0.042424   0.357210  -0.119 0.905461    
## PlatformGC        -0.403175   0.350678  -1.150 0.250267    
## PlatformPC        -1.629385   0.344163  -4.734 2.20e-06 ***
## PlatformPS         0.291387   0.368431   0.791 0.429010    
## PlatformPS2        0.566402   0.303879   1.864 0.062335 .  
## PlatformPS3        0.582168   0.307008   1.896 0.057926 .  
## PlatformPS4        0.477694   0.340350   1.404 0.160456    
## PlatformPSP       -0.194509   0.343374  -0.566 0.571079    
## PlatformPSV       -1.222290   0.555320  -2.201 0.027732 *  
## PlatformWii        0.902364   0.320502   2.815 0.004871 ** 
## PlatformWiiU      -0.042814   0.436672  -0.098 0.921896    
## PlatformX360       0.560877   0.306871   1.828 0.067590 .  
## PlatformXB        -0.975440   0.339820  -2.870 0.004099 ** 
## PlatformXOne       0.203193   0.363202   0.559 0.575856    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5287.3  on 5514  degrees of freedom
## Residual deviance: 4177.9  on 5485  degrees of freedom
## AIC: 4237.9
## 
## Number of Fisher Scoring iterations: 6

From the model coefficients, we can see that for each one unit increase in critic score, the game is 1.12 times more likely to be a bestseller (1.12 is an exponential of critic score coefficient of 0.109401).

## [1] "Accuracy 0.820159535895576"

Our model classified correctly 82% of the testing set. Let’s build a confusion matrix to see more in detail to which kind of errors this model is prone.

## Confusion Matrix and Statistics
## 
##               
## fitted.results    0    1
##              0 1048  211
##              1   37   83
##                                           
##                Accuracy : 0.8202          
##                  95% CI : (0.7989, 0.8401)
##     No Information Rate : 0.7868          
##     P-Value [Acc > NIR] : 0.00116         
##                                           
##                   Kappa : 0.3165          
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.28231         
##             Specificity : 0.96590         
##          Pos Pred Value : 0.69167         
##          Neg Pred Value : 0.83241         
##              Prevalence : 0.21320         
##          Detection Rate : 0.06019         
##    Detection Prevalence : 0.08702         
##       Balanced Accuracy : 0.62411         
##                                           
##        'Positive' Class : 1               
## 

Confusion matrix reveals that there is a much higher proportion of false negatives (15.3% - 211 games that are bestsellers but were incorrectly classified as not bestsellers) than false positives (2.7% - 37 games that are not bestsellers and were classified as ones). Since we want to be conservative about our predictions, this scenario is better than if we had a high proportion of false positives.

## Precision:  0.6916667
##  Recall:  0.2823129

High proportion of false negatives results in low recall (aka sensitivity) of 28.2%, meaning that out of all games that are bestsellers, only 28.2% were correctly classified as such. The rest were incorrectly classified as being not bestsellers.

Specificity is quite high (96.6%) meaning that most of the non-bestsellers are correctly classified as such.

Precision of the model is that of 69.2%, meaning that out of all the cases classified as bestsellers, 69.2% are actually bestsellers.

Which recently released games will become a bestseller

Let’s see what predictions our model will give for the followin recently released games:

##                      Name Platform        Genre Critic_Score User_Score
## 1   Red Dead Redemption 2      PS4       Action           97         78
## 2 Spyro Reignited Trilogy     XOne     Platform           82         33
## 3              Fallout 76       PC Role-Playing           59         28
##                      Name    predict bestseller
## 1   Red Dead Redemption 2 0.79768716          1
## 2 Spyro Reignited Trilogy 0.57560291          1
## 3              Fallout 76 0.01084852          0

Looks like our model predicts Red Dead Redemption 2 for PS4 and Spyro Reignited Trilogy for XOne to be bestsellers, while Fallout 76 for PC is classified as non-bestseller. If we look at the probability, even though both Red Dead Redemption 2 and Spyro Reignited Trilogy are classified as bestsellers, the model is much more sure about the first one becoming bestseller (0.8 probability vs 0.58 probability).

We will have to wait for some months to see whether our predictions have turned out to be true.

Multivariate Analysis

Franchise brand name and game creators know-how

While some franchises and developers are definitely more successful than others, making use of a brand name does not guarantee success. Some sagas have started high, but became less popular over time (Final Fantasy), while others were highly popular in some genres, but failed to expand the franchise successfully into other genres (Super Mario franchise).

Platform vs Year of Release

There is a high correlation between Platform and Year of Release which makes sense, as the newest games are primarily released for the latest console generations. Due to this correlation, Year of Release was removed from the model to avoid multicollinearity.

Predictive model

Logistic regression model was created in order to predict whether a game will be a bestseller based on its Genre, Platform, Critic Score and User Score.

Model’s prediction accuracy is that of 82%, with high specificity (96.6%) and low sensitivity (28.2%). The main weakness of the model is its high false negatives rate, meaning that many games that are actually bestsellers are classified as non-bestsellers.


Final Plots and Summary

Critic Score vs Global Sales

Understanding the relationship between the critic score and the global sales is important, since critic score will be one of the main predictor variables in our predictive model. The scatterplot and the fitted line show that there is a slight positive correlation, meaning that the higher the critic score the more game copies are sold.

Genre preferences by market

When analyzing a game’s performance it is important to take into account market preferences. Selling a game in the US market is not exactly the same as in a Japanese market, and this becomes even more clear when we look at the distributions of sales by genre in each region.

Some of the best selling genres in the US (Sports and Shooter) are one of the worst selling in Japan, whereas an unpopular within american gamers genre of Puzzle is selling quite well compared to other genres in Japan.

Super Mario franchise evolution

Another important aspect of games market is the power of the brand name and game’s developer know-how. Such classic franchises as Super Mario have been a huge success for the past two decades, creating dozens of titles for different platforms and expanding into various genres. However, even such hits can have their highs and lows, and figuring out the target audiences and their preferences for genres and platforms is important even if you are Nintendo.

The graph shows all titles from Super Mario franchise released between 2000 and 2016. The colour of the dot indicates the genre, and the size of the dot is correlated with the critic score it received. The titles are ordered by year of release and the vertical axis shows the global sales they yielded. We can see which titles were more successful in terms of sales and in terms of critics reception. We can also spot that Platform and Racing genres sell better than Sports or Puzzle in case of Super Mario franchise.


Reflection

Whether video games are a form of art or just a source of entertainment is a long lasting debate. But I was interested in taking a more analytical approach to what it takes to make a great game. Is it the creators themselves and their artistic skills and know-how? Or maybe it is the brand name of a franchise that translates into high sales? These and many other questions about the games industry were driving my analysis.

The dataset was taken from Kaggle and it combines 2 datasets coming from different websites dedicated to games (vgchartz.com and Metacritic.com). This is important to keep in mind for the first stage of analysis where the data was cleaned in order to remove any missing values or incorrect formats. The source of the data also conditions the conclusions we can drive from it. Since the websites audience is primarily from the US, such data points as critic score and user score most probably represent more the american audience. And the selection of the games themselves is also affected by this bias, since more local games that are popular in other regions rather than the US, are likely to be underrepresented in the sample.

The exploratory analysis of the data revealed some interesting insights. I was surprised to find out that critic score is more highly correlated with the global sales than the users score. Even though the users might like the game it does not necessarily mean they are willing to pay for it (which happens a lot with Strategy games that tend to be more popular on PC and are more prone to piracy because of that). It was also interesting to see how some differences between markets were revealed by the data analysis.

Finally, the model was built to predict whether a specific game is a bestseller or not based on the critic and user score, the genre and the platform. While the accuracy of 82% was achieved with little tweaks to the model, there is a lot of potential for improvement. Some limitations of this model include the source of the data discussed above (the games sample is biased towards the US market) and the amount of entries with missing data.

Potential improvements can be made by adding more variables to the model (for instance Publisher and Developer), the main obstacle being the amount of different publishers and developers represented in the dataset. Another possible direction of analysis would be to group games by their franchises (implicit in the game’s name) and see whether this has effects on the accuracy of predictions.

The analysis could be also replicated for other markets by scraping local websites dedicated to games. Another machine learning models (for example Random Forest Classifier) can be applied to the data to see whether the predictions are more accurate.


List of References